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Abstract 



The affine rank minimization problem consists of finding a matrix of minimum rank that 
satisfies a given system of linear equality constraints. Such problems have appeared in the liter- 
ature of a diverse set of fields including system identification and control, Euclidean embedding, 
and collaborative filtering. Although specific instances can often be solved with specialized al- 
gorithms, the general affine rank minimization problem is NP-hard, because it contains vector 
cardinality minimization as a special case. 

In this paper, we show that if a certain restricted isometry property holds for the linear 
transformation defining the constraints, the minimum rank solution can be recovered by solving 
a convex optimization problem, namely the minimization of the nuclear norm over the given 
affine space. We present several random ensembles of equations where the restricted isometry 
property holds with overwhelming probability, provided the codimension of the subspace is 
£l(r(m + n) logmn), where m, n are the dimensions of the matrix, and r is its rank. 

The techniques used in our analysis have strong parallels in the compressed sensing frame- 
work. We discuss how affine rank minimization generalizes this pre-existing concept and outline 
a dictionary relating concepts from cardinality minimization to those of rank minimization. We 
also discuss several algorithmic approaches to solving the norm minimization relaxations, and 
illustrate our results with numerical examples. 

Keywords, rank, convex optimization, matrix norms, random matrices, compressed sensing, semidefmite program- 
ming. 

1 Introduction 

Notions such as order, complexity, or dimensionality can often be expressed by means of the rank of 
an appropriate matrix. For example, a low-rank matrix could correspond to a low-degree statistical 
model for a random process (e.g., factor analysis), a low-order realization of a linear system [28], a 
low-order controller for a plant |22j . or a low-dimensional embedding of data in Euclidean space [34] . 
If the set of feasible models or designs is affine in the matrix variable, choosing the simplest model 
can be cast as an affine rank minimization problem, 



minimize rank(A) 
subject to A(X) = b, 



(1.1) 
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where X S ]j mxra jg the decision variable, and the linear map A : M mxn — ► M p and vector b € W 
are given. In certain instances with very special structure, the rank minimization problem can 
be solved by using the singular value decomposition, or can be exactly reduced to the solution of 
linear systems |37l I41j . In general, however, problem (1.1) is a challenging nonconvex optimization 
problem for which all known finite time algorithms have at least doubly exponential running times 
in both theory and practice. For the general case, a variety of heuristic algorithms based on local 
optimization, including alternating projections [31] and alternating LMIs [45 , have been proposed. 

A recent heuristic introduced in [27] minimizes the nuclear norm, or the sum of the singular 
values of the matrix, over the affine subset. The nuclear norm is a convex function, can be optimized 
efficiently, and is the best convex approximation of the rank function over the unit ball of matrices 
with norm less than one. When the matrix variable is symmetric and positive semidefinite, this 
heuristic is equivalent to the trace heuristic often used by the control community (see, e.g., 37J). 
The nuclear norm heuristic has been observed to produce very low-rank solutions in practice, but a 
theoretical characterization of when it produces the minimum rank solution has not been previously 
available. This paper provides the first such mathematical characterization. 

Our work is built upon a large body of literature on a related optimization problem. When 
the matrix variable is constrained to be diagonal, the affine rank minimization problem reduces 
to finding the sparsest vector in an affine subspace. This problem is commonly referred to as 
cardinality minimization, since we seek the vector whose support has the smallest cardinality, and 
is known to be NP-hard [39|. For diagonal matrices, the sum of the singular values is equal to the 
sum of the absolute values (i.e., the i\ norm) of the diagonal elements. Minimization of the t\ norm 
is a well-known heuristic for the cardinality minimization problem, and stunning results pioneered 
by Candes and Tao [10] and Donoho |17j have characterized a vast set of instances for which the 
t\ heuristic can be a priori guaranteed to yield the optimal solution. These techniques provide the 
foundations of the recently developed compressed sensing or compressive sampling frameworks for 
measurement, coding, and signal estimation. As has been shown by a number of research groups 
(e.g., [H [12] [13] [13]), the l\ heuristic for cardinality minimization provably recovers the sparsest 
solution whenever the sensing matrix has certain "basis incoherence" properties, and in particular, 
when it is randomly chosen according to certain specific ensembles. 

The fact that the t\ heuristic is a special case of the nuclear norm heuristic suggests that these 
results from the compressed sensing literature might be extended to provide guarantees about the 
nuclear norm heuristic for the more general rank minimization problem. In this paper, we show 
that this is indeed the case, and the parallels are surprisingly strong. Following the program laid 
out in the work of Candes and Tao, our main contribution is the development of a restricted 
isometry property (RIP), under which the nuclear norm heuristic can be guaranteed to produce 
the minimum-rank solution. Furthermore, as in the case for the l\ heuristic, we provide several 
specific examples of matrix ensembles for which RIP holds with overwhelming probability. Our 
results considerably extend the compressed sensing machinery in a so far undeveloped direction, 
by allowing a much more general notion of parsimonious models that rely on low-rank assumptions 
instead of cardinality restrictions. 

To make the parallels as clear as possible, we begin by establishing a dictionary between the 
matrix rank and nuclear norm minimization problems and the vector sparsity and l\ norm problems 
in Section [2] In the process of this discussion, we present a review of many useful properties of 
the matrices and matrix norms necessary for the main results. We then generalize the notion of 
Restricted Isometry to matrices in Section [3] and show that when linear mappings are Restricted 
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Isometries, recovering low-rank solutions of underdetermined systems can be achieved by nuclear 
norm minimization. In Section [4] we present several families of random linear maps that are 
restricted isometries with overwhelming probability when the dimensions are sufficiently large. 
In Section [5] we briefly discuss three different algorithms designed for solving the nuclear norm 
minimization problem and their relative strengths and weaknesses: interior point methods, gradient 
projection methods, and a low-rank factorization technique. In Section [6j we demonstrate that in 
practice nuclear-norm minimization recovers the lowest rank solutions of affine sets with even fewer 
constraints than those guaranteed by our mathematical analysis. Finally, in Section [7j we list a 
number of possible directions for future research. 

1.1 When are random constraints interesting for rank minimization? 

As in the case of compressed sensing, the conditions we derive to guarantee properties about the 
nuclear norm heuristic are deterministic, but they are at least as difficult to check as solving the 
rank minimization problem itself. We are only able to guarantee that the nuclear norm heuristic 
recovers the minimum rank solution of A{X) = b when A is sampled from specific ensembles of 
random maps. The constraints appearing in many of the applications mentioned above, such as 
low-order control system design, are typically not random at all and have structured demands 
according to the specifics of the design problem. It thus behooves us to present several examples 
where random constraints manifest themselves in practical scenarios for which no practical solution 
procedure is known. 

Minimum order linear system realization Rank minimization forms the basis of many model 
reduction and low-order system identification problems for linear time-invariant (LTI) systems. 
The following example illustrates how random constraints might arise in this context. Consider 
the problem of finding the minimum order discrete-time LTI system that is consistent with a set of 
time-domain observations. In particular, suppose our observations are the system output sampled 
at a fixed time N, after a random Gaussian input signal is applied from t = to t = N. Suppose we 
make such measurements for p different input signals, that is, we observe yi{N) = J2tLo a>i(N—t)h(t) 
for i = 1, . . . ,p, where aj, the ith input signal, is a zero-mean Gaussian random variable with the 
same variance for t = 0, . . . N, and h(t) denotes the impulse response. We can write this compactly 
as y = Ah, where h = [h(0),. . . , h(N)]', and A {j = m(N - j). 

From linear system theory, the order of the minimal realization for such a system is given by 
the rank of the following Hankel matrix (see, e.g., |29[ H6|) 

h(0) h{l) ■■■ h{N) 
h(l) h(2) ••• h(N+l) 

h(N) h(N + l) ■■■ h(2N) 

Therefore the problem can be expressed as 

minimize rank(hank(/i)) 
subject to Ah = y 

where the optimization variables are h(0), . . . , h(2N), and the matrix A consists of i.i.d. zero-mean 
Gaussian entries. 



hank(/i) : = 
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Low- Rank Matrix Completion In the matrix completion problem where we are given random 
subset of entries of a matrix, we would like to fill in the missing entries such that the resulting 
matrix has the lowest possible rank. This problem arises in machine learning scenarios where we 
are given partially observed examples of a process with a low-rank covariance matrix and would 
like to estimate the missing data. A typical situation where the hidden matrix is low-rank is when 
the columns are i.i.d. samples of a random process with low-rank covariance. Such models are 
ubiquitous in Factor Analysis, Collaborative Filtering, and Latent Semantic Indexing |42 ] 147] . In 
many of these settings, some prior probability distribution (such as a Bernoulli model or uniform 
distribution on subsets) is assumed to generate the set of available entries. 

Suppose we are presented with a set of triples J{i), S(i)) for i = 1, . . . ,p and wish to find 

the matrix with S(i) in the entry corresponding to row and column J(i) for all i. The matrix 
completion problem seeks to solve 

min rank(l") 
Y 

s-t- Y l(i),J(i) = i = l,...,K 

which is a special case of the affine rank minimization problem. 

Low-dimensional Euclidean embedding problems A problem that arises in a variety of 
fields is the determination of configurations of points in low-dimensional Euclidean spaces, subject 
to some given distance information. In Multi-Dimensional Scaling (MDS), such problems occur in 
extracting the underlying geometric structure of distance data. In psychometrics, the information 
about inter-point distances is usually gathered through a set of experiments where subjects are 
asked to make quantitative (in metric MDS) or qualitative (in non-metric MDS) comparisons of 
objects. In computational chemistry, they come up in inferring the three-dimensional structure of 
a molecule (molecular conformation) from information about interatomic distances |52j . 

A symmetric matrix D £ S n is called a Euclidean distance matrix (EDM) if there exist points 
Xi, ...,x n in M rf such that Dij = \\x{ — Xj\\ 2 . Let V := I n — -11 T be the projection matrix onto 
the hyperplane {v € K n : l T v = 0}. A classical result by Schoenberg states that D is a Euclidean 
distance matrix of n points in Mr if and only if Da = 0, the matrix VDV is negative semidefinite, 
and Tank(yDV) is less than or equal to d [44 . If the matrix D is known exactly, the corresponding 
configuration of points (up to a unitary transform) is obtained by simply taking a matrix square 
root of —t^VDV. However, in many cases, only a random sampling collection of the distances are 
available. The problem of finding a valid EDM consistent with the known inter-point distances and 
with the smallest embedding dimension can be expressed as the rank optimization problem 

minimize rank(V\DV) 
subject to VDV < 
A(D) = b, 

where A : S n — > M p is a random sampling operator as discussed in the matrix completion problem. 

This problem involves a Linear Matrix Inequality (LMI) and appears to be more general than 
the equality constrained rank minimization problem. However, general LMIs can equivalently be 
expressed as rank constraints on an appropriately defined block matrix. The rank of a block 
symmetric matrix is equal to the rank of a diagonal block plus the rank of its Schur complement 
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(see, e.g., \33\ §2.2]). Given a function / that maps matrices into q x q symmetric matrices, that 
f{X) is positive semidefinite can be equivalently expressed through a rank constraint as 



f(X) h & 



rank 



B 



B' f(X) 



< q, for some B £ 



That is, if there exists a matrix B satisfying the inequality above, then f{X) = B'B y 0. Using 



this equivalent representation allows us to rewrite (1.1) as an affine rank minimization problem. 



Image Compression A simple and well-known method to compress two-dimensional images can 
be obtained by using the singular value decomposition (e.g., [3]). The basic idea is to associate to 
the given grayscale image a rectangular matrix M, with the entries My corresponding to the gray 
level of the pixel. The best rank-A; approximation of M is given by 

X* := arg min ||M — XII, 

rank(X)<fc 

where 1 1 • 1 1 is any unitarily invariant norm. By the classical Eckart- Young-Mirsky theorem ( [20 , 38j ) , 
the optimal approximant is given by a truncated singular value decomposition of M, i.e., if M = 
UT,V T , then X* = UT,kV T , where the first k diagonal entries of are the largest k singular values, 
and the rest of the entries are zero. If for a given rank k, the approximation error \\M — X*\ \ is small 
enough, then the amount of data needed to encode the information about the image is k{m + n — k) 
real numbers, which can be much smaller than the mn required to transmit the values of all the 
entries. 

Consider a given image, whose associated matrix M has low-rank, or can be well-approximated 
by a low-rank matrix. As proposed by Wakin et al. [53] , a single-pixel camera would ideally produce 
measurements that are random linear combinations of all the pixels of the given image. Under this 
situation, the image reconstruction problem boils down exactly to affine rank minimization, where 
the constraints are given by the random linear functionals. 

It should be remarked that the simple SVD image compression scheme described has certain 
deficiencies that more sophisticated techniques do not share (in particular, the lack of invariance 
of the description length under rotations). Nevertheless, due to its simplicity and relatively good 
practical performance, this method is particularly popular in introductory treatments and numerical 
linear algebra textbooks. 



2 From Compressed Sensing to Rank Minimization 

As discussed above, when the matrix variable is constrained to be diagonal, the affine rank mini- 



mization problem (1.1) reduces to the cardinality minimization problem of finding the element in 
the affine space with the fewest number of nonzero components. In this section we will establish 
a dictionary between the concepts of rank and cardinality minimization. The main elements of 
this correspondence are outlined in Table [2} With these elements in place, the existing proofs of 
sparsity recovery provide a template for the more general case of low-rank recovery. 

In establishing our dictionary, we will provide a review of useful facts regarding matrix norms 
and their characterization as convex optimization problems. We will show how computing both 
the operator norm and the nuclear norm of a matrix can be cast as semidefinite programming 
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parsimony concept 


cardinality 


rank 


Hilbert Space norm 


Euclidean 


Frobenius 


sparsity inducing norm 


h 


nuclear 


dual norm 


f 


operator 


norm additivity 


disjoint support 


orthogonal row and column spaces 


convex optimization 


linear programming 


semidefinite programming 



Table 1: A dictionary relating the concepts of cardinality and rank minimization. 



problems. We also establish the suitable optimality conditions for the minimization of the nuclear 
norm under affine equality constraints, the main convex optimization problem studied in this article. 
Our discussion of matrix norms will mostly follow the discussion in |27l 153] where extensive lists of 
references are provided. 

Matrix vs. Vector Norms The three vector norms that play significant roles in the compressed 
sensing framework are the l\, I2, and loo norms, denoted by ||x||i, ||x|| and ||x||oc respectively. 
These norms have natural generalizations to matrices, inheriting many appealing properties from 
the vector case. In particular, there is a parallel duality structure. 

For a rectangular matrix X £ W nxn , ai{X) denotes the i-tb. largest singular value of X and 
is equal to the square-root of the z-th largest eigenvalue of XX' . The rank of X will usually be 
denoted by r, and is equal to the number of nonzero singular values. For matrices X and Y of the 
same dimensions, we define the inner product in M mxn as (X, Y) := Ti(X'Y) = Yli=i EjLi *y^ij- 
The norm associated with this inner product is called the Frobenius (or Hilbert-Schmidt) norm 
|| • \ \f- The Frobenius norm is also equal to the Euclidean, or I2, norm of the vector of singular 
values, i.e., 

1 

(m n \ 2 / r \ 2 

EE^- = £°n • 
8=1 j = l J \i=l / 

The operator norm (or induced 2-norm) of a matrix is equal to its largest singular value (i.e., the 
loo norm of the singular values): 

\\X\\ := a 1 {X). 

The nuclear norm of a matrix is equal to the sum of its singular values, i.e., 

r 

\\X\\* := Oi(X) , 
i=l 

and is alternatively known by several other names including the Schatten 1-norm, the Ky Fan 
r-norm, and the trace class norm. Since the singular values are all positive, the nuclear norm is 
equal to the l\ norm of the vector of singular values. These three norms are related by the following 
inequalities which hold for any matrix X of rank at most r: 

11*11 < II*Hf < 11*11* < v^II*Hf < A\X\\. (2.1) 



6 



Dual norms For any given norm || • | 
denned as 



in an inner product space, there exists a dual norm || • ||^ 
= max{{X,Y) : ||Y|| < 1}. (2.2) 



Furthermore, the norm dual to the norm || • ||^ is again the original norm || • ||. 

In the case of vector norms in IR™, it is well-known that the dual norm of the £ p norm (with 
1 < p < oo) is the l q norm, where ^ + ^ = 1- This fact is essentially equivalent to Holder's 
inequality. Similarly, the dual norm of the norm of a vector is the l\ norm. These facts also 
extend to the matrix norms we have defined. For instance, the dual norm of the Frobenius norm is 
the Frobenius norm. This can be verified by simple calculus (or Cauchy-Schwarz), since 

max{Tr(X'y) : Tr(Y'y) < 1} 

is equal to with the maximizing Y being equal to Similarly, as shown below, the 

dual norm of the operator norm is the nuclear norm. The proof of this fact will also allow us to 
present variational characterizations of each of these norms as semidefinite programs. 



Proposition 2.1 The dual norm of the operator norm 



in 



is the nuclear norm 



Proof First consider an m x n matrix Z. The fact that Z has operator norm less than t can be 
expressed as a linear matrix inequality: 



\Z\\ < t 



t 2 I 



zz' y o 



Z' 



Z 

tl n 



(2.3) 



where the last implication follows from a Schur complement argument. As a consequence, we can 
give a semidefinite optimization characterization of the operator norm, namely 



\Z\ 



min t 

t 



s.t. 



tlm 

Z' 



Z 

tin 



y o. 



(2.4) 



Now let X = UYiV be a singular value decomposition of an m x n matrix X, where U is an 
m x r matrix, V is an n x r matrix, S is a r x r diagonal matrix and r is the rank of X. Let 
Y := UV'. Then ||Y|| = 1 and Tr(XY') = Y2i=i a i(X) = an d hence the dual norm is 

greater than or equal to the nuclear norm. 

To provide an upper bound on the dual norm, we appeal to semidefinite programming duality. 
From the characterization in (2.3), the optimization problem 

max{(X, Y) : \\Y\\ < 1} 
is equivalent to the semidefinite program 



max Ty(X'Y) 
T Y 

J IT). -L 



s.t. 



y o. 



(2.5) 
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The dual of this SDP (after an inconsequential rescaling) is given by 

1(Tt(W 1 )+Tt(W 2 )) 
W, X 1 

>- 0. 



mm 

Wi,W 2 



s.t. 



Wi 
X' 



X 

w 2 



(2.6) 



Set Wi := UTXJ' and W 2 := VYV' . Then the triple (Wi, W 2 ,X) is feasible for (2.6) since 



Wi X 
X' w 2 





u 




u 








V 




V 



y o. 



TcfWjj) = Tr(S), and thus the objective function satisfies (Tr(Wi)+ 



Furthermore, we have Tr(Wi 
Tr(W2))/2 = Tr £ = Since any feasible solution of (2.6) provides an upper bound for (|2.5 



we have that the dual norm is less than or equal to the nuclear norm of X, thus proving the 
proposition. ■ 

Notice that the argument given in the proof above further shows that the nuclear norm 



can be computed using either the SDP (2.5) or its dual (2.6), since there is no duality gap between 



them. Alternatively, this could have also been proven using a Slater-type interior point condition 



since both (2.5) and (2.6) admit strictly feasible solutions. 



Convex envelopes of rank and cardinality functions Let C be a given convex set. The 
convex envelope of a (possibly nonconvex) function / : C —* M is defined as the largest convex 
function g such that g{x) < f(x) for all x £ C (see, e.g., |32j). This means that among all convex 
functions, g is the best pointwise approximation to /. In particular, if the optimal g can be 
conveniently described, it can serve as an approximation to / that can be minimized efficiently 

By the chain of inequalities in (2.1), we have that rank(X) > ||X||*/||X|| for all X. For all 
matrices with ||X|| < 1, we must have that rank(X) > so the nuclear norm is a convex lower 

bound of the rank function on the unit ball in the operator norm. In fact, it can be shown that 
this is the tightest convex lower bound. 



Theorem 2.2 ([27J) The convex envelope o/rank(X) on the set {X £ 

nuclear norm IIXIL. 



|X|| < 1} is the 



The proof is given in |27] and uses a basic result from convex analysis that establishes that (under 
some technical conditions) the biconjugate of a function is its convex envelope [32] . 



Theorem 2.2 provides the following interpretation of the nuclear norm heuristic for the affine 
rank minimization problem. Suppose Xq is the minimum rank solution of A(X) = b, and M = 
\\X \\. The convex envelope of the rank on the set C = {X £ W nxn : \\X\\ < M} is ||X||*/M. Let 
X* be the minimum nuclear norm solution of A(X) = b. Then we have 

||XJ*/M < rank(Xo) < rank(X*) 

providing an upper and lower bound on the optimal rank when the norm of the optimal solution is 
known. Furthermore, this is the tightest lower bound among all convex lower bounds of the rank 
function on the set C. 

For vectors, we have a similar inequality. Let card(x) denote the cardinality function which 
counts the number of non-zero entries in the vector x. Then we have card(x) > ||x||i/||x||oo- 
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Not surprisingly, the l\ norm is also the convex envelope of the cardinality function over the set 
{x 6 W 1 : ||x||oo < 1}. This result can be either proven directly or can be seen as a special case 
of the above theorem. 



Additivity of rank and nuclear norm A function / mapping a linear space S to M is called 
subadditive if f(x + y) < f(x) + f(y). It is additive if f(x + y) = f(x) + f(y). In the case of 
vectors, both the cardinality function and the i\ norm are subadditive. That is, if x and y are 
sparse vectors, then it always holds that the number of non-zeros in x + y is less than or equal 
to the number of non-zeros in x plus the number of non-zeros of y; furthermore (by the triangle 
inequality) ||a: + y||i < ||ic||x + ||y||i- In particular, the cardinality function is additive exactly when 
the vectors x and y have disjoint support. In this case, the t\ norm is also additive, in the sense 
that \\x + y\\i = + IMIi- 

For matrices, the rank function is subadditive. For the rank to be additive, it is necessary and 
sufficient that the row and column spaces of the two matrices intersect only at the origin, since in 
this case they operate in essentially disjoint spaces (see, e.g., [36])- As we will show below, a related 
condition that ensures that the nuclear norm is additive, is that the matrices A and B have row 
and column spaces that are orthogonal. In fact, a compact sufficient condition for the additivity 
of the nuclear norm will be that AB' = and A'B = 0. This is a stronger requirement than the 
aforementioned condition for rank additivity, as orthogonal subspaces only intersect at the origin. 
The disparity arises because the nuclear norm of a linear map depends on the choice of the inner 
products on the spaces W 1 and W 1 on which the matrix acts, whereas the rank is independent of 
such a choice. 

Lemma 2.3 Let A and B be matrices of the same dimensions. If AB' = and A'B = then 
\\A + B\\, = ||A||* + 

Proof Partition the singular value decompositions of A and B to reflect the zero and non-zero 
singular vectors 



A = [ U A1 U A 2 







[ V A1 V A 2 ]' B=[ U B i U B2 







[ Vbi Vb2 ]' 



The condition AB' = implies that V a1 Vbi = 0, and similarly, A'B = implies that U' a1 Ubi = 0. 
Hence, there exist matrices Uc and Vc such that [Uai Ubi Uc] and [Vai Vbi Vc] are orthogonal 
matrices. Thus, the following are valid singular value decompositions for A and B: 



A = [ U A i Ubi U c \ 

B=[U A x Ubi U c ' 

In particular, we have that 

A + B = [ U A i U m 



[ Vai Vbi V c ]' 
[ Vai Vbi V c ]' 

[ Vai Vbi ]' • 
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This shows that the singular values of A + B are equal to the union (with repetition) of the singular 
values of A and B. Hence, -£>[[* = ||-<4||* + as desired. ■ 

Corollary 2.4 Let A and B be matrices of the same dimensions. If the row and column spaces of 
A and B are orthogonal, then \\A + = \\A\\* + ||-B||*. 

Proof It suffices to show that if the row and column spaces of A and B are orthogonal, then 
AB' = and A'B = 0. But this is immediate: if the columns of A are orthogonal to the columns 
of B, we have A'B = 0. Similarly, orthogonal row spaces imply that AB' = as well. ■ 



Nuclear norm minimization Let us turn now to the study of equality-constrained norm min- 
imization problems where we are searching for a matrix X G l mxn of minimum nuclear norm 
belonging to a given affine subspace. In our applications, the subspace is usually described by 
linear equations of the form A{X) = b, where A : W nxn — > M p is a linear mapping. This problem 
admits the primal-dual convex formulation 



mm 

X 

s.t 



ll-^ll* 
A(X) 



max 



b'z 



s.t. ||X(z)|| < 1, 



(2.7) 



is the adjoint of A. The formulation ( 2.7 ) is valid for any norm minimization 



where A* : W -> W 

problem, by replacing the norms appearing above by any dual pair of norms. In particular, if we 
replace the nuclear norm with the l\ norm and the operator norm with the loa norm, we obtain a 
primal-dual pair of optimization problems, that can be reformulated in terms of linear programming. 



Using the SDP characterizations of the nuclear and operator norms given in (2.5)-(2.6) above 
allows us to rewrite (|2.7|) as the primal-dual pair of semidefinite programs 



min ±{Ti{Wi) +Tr{W 2 )) 



s.t. 



max b'z 



Wi X 
X' w 2 

A(X) 



h 
= b 



s.t. 



Im A*(z) 
A*{z)' I n 



y o. 



(2. 



Optimality conditions In order to describe the optimality conditions for the norm minimization 
problem (2.7), we must first characterize the subdifferential of the nuclear norm. Recall that for a 
convex function / : W 1 — * R, the subdifferential of / at x G W 1 is the compact convex set 

df(x) := {d G R n : f(y) > f{x) + (d,y - x) Vy G 1"}. 

Let Ibeanmxii matrix with rank r and let X = UTiV be a singular value decomposition where 
U G M. mxr , V G W nxr and S is an r x r diagonal matrix. The subdifferential of the nuclear norm 
at X is then given by (see, e.g., [55 J 



= {UV' + W : W and X have orthogonal row and column spaces, and [|W[| < 1}. (2.9) 
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For comparison, recall the case of the l\ norm, where T denotes the support of the n-vector x, T c 
is the complement of T in the set { 1 , . . . , n} , and 



d\\x\\i = {d G R n : di = sign(x) for i G T, \di\ < 1 for i G T c }. 



(2.10) 



The similarity between (]2.9[) and (2.10) is particularly transparent if we recall the polar decomposi- 



tion of a matrix into a product of orthogonal and positive semidefinite matrices (see, e.g., [33] )• The 
"angular" component of the matrix X is exactly given by UV' . Thus, these subgradients always 
have the form of an "angle" (or sign) , plus possibly a contraction in an orthogonal direction if the 
norm is not differentiable at the current point. 

We can now write concise optimality conditions for the optimization problem (2.7). A matrix 
X is optimal for (2.7) if there exists a vector z GM. P such that 

A(X) = b, 



A*(z) G d\\X\ 



(2.11) 



The first condition in (2.11 ) requires feasibility of the linear equations, and the second one guaran- 



tees that there is no feasible direction of improvement. Indeed, since A*(z) is in the subdifferential 



at X, for any Y in the primal feasible set of (2.7) we have 



\Y\L > \\X\ 



\X\ 



+ (A*{z),Y-X) = \\X\\* + (z,A(Y-X)) 
where the last step follows from the feasibility of X and Y . As we can see, the optimality conditions 



(2.11 ) for the nuclear norm minimization problem exactly parallel those of the l\ optimization case. 



These optimality conditions can be used to check and certify whether a given candidate X is 
indeed the minimum nuclear norm solution. For this, it is sufficient (and necessary) to find a vector 
z G W in the subdifferential of the norm, i.e., such that the left- and right-singular spaces of A*{z) 
are aligned with those of X, and is a contraction in the orthogonal complement. 



3 Restricted Isometry and Recovery of Low- Rank Matrices 

Let us now turn to the central problem analyzed in this paper. Let A : R mxn — > W be a linear 
map and let Xq be a matrix of rank r. Set b := A(Xq), and define the convex optimization problem 

X* := argmin \\X\\* s.t. A{X)=b. (3.1) 

In this section, we will characterize specific cases when we can a priori guarantee that X* = Xq. 
The key conditions will be determined by the values of a sequence of parameters 5 r that quantify 
the behavior of the linear map A when restricted to the subvariety of matrices of rank r. The 
following definition is the natural generalization of the Restricted Isometry Property from vectors 
to matrices. 

Definition 3.1 Let A : W nxn — > IRP be a linear map. Without loss of generality, assume m < n. 
For every integer r with 1 < r < m, define the r-restricted isometry constant to be the smallest 
number 5 r {A) such that 

(l-5 r {A))\\X\\ F <\\A{X)\\<{l + 5 r {A))\\X\\ F (3.2) 
holds for all matrices X of rank at most r. 
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Note that by definition, 5 r {A) < S r i(A) for r < r' . 
The Restricted Isometry Property for sparse vectors was developed by Candes and Tao in [1 



and requires (3.2) to hold with the Euclidean norm replacing the Frobenius norm and rank being 
replaced by cardinality. Since for diagonal matrices, the Frobenius norm is equal to the Euclidean 
norm of the diagonal, this definition reduces to the original Restricted Isometry Property of [2] in 
the diagonal case[j] 

Unlike the case of "standard" compressed sensing, our RIP condition for low-rank matrices 
cannot be interpreted as guaranteeing all sub-matrices of the linear transform A of a certain size 



are well conditioned. Indeed, the set of matrices X for which (3.2) must hold is not a finite union 
of subspaces, but rather a certain "generalized Stiefel manifold," which is also an algebraic variety 
(in fact, it is the rth-secant variety of the variety of rank-one matrices). Surprisingly, we are still 
able to derive analogous recovery results for low-rank solutions of equations when A obeys this RIP 
condition. Furthermore, we will see in Section [4] that many ensembles of random matrices have 
the Restricted Isometry Property with 5 r quite small with high probability for reasonable values of 
m,n, and p. 

The following two recovery theorems will characterize the power of the restricted isometry 
constants. Both theorems are more or less immediate generalizations from the sparse case to the 
low-rank case and use only minimal properties of the rank of matrices and the nuclear norm. The 
first theorem generalizes Lemma 1.3 in [TJ] to low-rank recovery. 

Theorem 3.2 Suppose that 52 r < 1 for some integer r > 1. Then Xq is the only matrix of rank 
at most r satisfying A{X) = b. 

Proof Assume, on the contrary, that there exists a rank r matrix X satisfying A{X) = b and 
X ^ Xq. Then Z := Xq — X is a nonzero matrix of rank at most 2r, and A{Z) = 0. But then we 
would have = ||«4(Z)|| > (1 — #2r) ||^||f > which is a contradiction. ■ 

The proof of the preceding theorem is identical to the argument given by Candes and Tao and 
is an immediate consequence of our definition of the constant S r . No adjustment is necessary in the 
transition from sparse vectors to low-rank matrices. The key property used is the sub-additivity of 
the rank. 

Next, we state a weak £i-type recovery theorem whose proof mimics the approach in [12 , but 
for which a few details need to be adjusted when switching from vectors to matrices. 

Theorem 3.3 Suppose that r > 1 is such that 8^ r < 1/10. Then X* = Xq. 

We will need the following technical lemma that shows for any two matrices A and B, we can 
decompose B as the sum of two matrices B\ and B2 such that rank(i?i) is not too large and B2 



satisfies the conditions of Lemma 2.3 This will be the key decomposition for proving Theorem 3.3 



Lemma 3.4 Let A and B be matrices of the same dimensions. Then there exist matrices B\ and 
B2 such that 

1. B = B X +B 2 



1 In [14], the authors define the restricted isometry properties with squared norms. We note here that the analysis 
is identical modulo some algebraic rescaling of constants. We choose to drop the squares as it greatly simplifies the 
analysis in Section [4] 
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2. rank( J Bi) < 2rank(A) 

3. AB' 2 = and A'B 2 = 

4. (B 1 ,B 2 )=0 

Proof Consider a full singular value decomposition of A 



A = U 



£ 




and let B := U'BV. Partition B as 



B 



Bll Byi 

B21 B22 



Denning now 



B 1 := U 



Bu B12 
B21 



v, 



Bo := U 





B 22 



it can be easily verified that Bi and B2 satisfy the conditions (1)— (4). 



V' 



We now proceed to a proof of Theorem |3.3| 
Proof [of Theorem [373) By optimality of X* , we have ||X ||* > \\X*\\*. Let R := X* - X . 
Applying Lemma |3.4| to the matrices Xq and R, there exist matrices i?o and R c such that R = 
R + R c , rank( J R ) < 2rank(X ), and X R' C = and X' R C = 0. Then, 



IX, 



oil* 



> \\Xq + R L > \\Xq + R c 



\R 



oil* 



Xq L + ||-Rc||* 



I ft 



1 1 * 5 



(3.3) 



where the middle assertion follows from the triangle inequality and the last one from Lemma 2.3 
Rearranging terms, we can conclude that 



l-Roll* > \\R C 



(3.4) 



Next we partition R c into a sum of matrices Ri, R2, ■ ■ ., each of rank at most 3r. Let R c = 
U diag(cr)y be the singular value decomposition of R c . For each i > 1 define the index set 
It = {3r(i - 1) + 1, . . . ,3ri}, and let Ri := U h diag(cr 7 .)V/. (notice that {R i7 Rj) = if i ^ j). By 
construction, we have 



1 x - 

° k - 3^ Z> °"i 

3 eh 



Vfc G 



(3.5) 



which implies < g-[| J2j [|^. We can then compute the following bound 



i>2 



3r 



< 



ft 



Oil* 



V2r„ , 



'3r 



(3.6) 
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where the last inequality follows from (2.1 ) and the fact that rank(i?o) < 2r. Finally, note that the 



rank of Rq + R± is at most 5r, so we may put this all together as 

\\A(R)\\>\\A(Ro + Ri)\\-J2\\ A ^W 

i>2 

> (l-S^WRo+R^p-il + Ssr) J2\\Rj\\F 



j>2 (3.7) 



> ((1 - fcr) " + M) II«0||f 

> ((i-M-n^ + M) II^oIIf. 

By assumption ^l(i?) = .A(X* — Xq) = 0, so if the factor on the right-hand side is strictly positive, 



Ro = 0, which further implies R c = by (3.4), and thus X* = Xq. Simple algebra reveals that 
the right-hand side is positive when 9#3 r + ll<55 r < 2. Since <53 r < 5§ r , we immediately have that 
X* = X if 5 5r < 1/10. ■ 

The rational number (9/11) in the proof of the theorem is chosen for notational simplicity and 
is clearly not optimal. A slightly tighter bound can be achieved working directly with the second 



to last line in (3.7). The most important point is that our recovery condition on 5§ r is an absolute 
constant, independent of m, n, r, and p. 

We have yet to demonstrate any specific linear mappings A for which 5 r < 1. We shall show in 
the next section that linear transformations sampled from several families of random matrices with 
appropriately chosen dimensions have this property with overwhelming probability. The analysis is 
again similar to the compressive sampling literature, but several details specific to the rank recovery 
problem need to be employed. 

4 Nearly Isometric Families 

In this section, we will demonstrate that when we sample linear maps from a class of probability 
distributions obeying certain tail bounds, then they will obey the Restricted Isometry Property 



(3.2) as p, m, and n tend to infinity at appropriate rates. The following definition characterizes 



this family of random linear transformation. 

Definition 4.1 Let A be a random variable that takes values in linear maps from W nxn to W. 
We say that A is nearly isometrically distributed if for all X G 



pmxn 



E[\\A(X)r] = \\X\\ F (4.1) 



and for all < e < 1 we have, 



P(IMP0r - \\X\\ F \ > e\\X\\ F ) < 2exp ( -^/2 - e^/3) ) (4.2) 



and for all t > 0, we have 



for some constant 7 > 0. 



P( \\A\\ >1 + J— + t) <e*p(-7pO (4.3) 
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There are two ingredients for a random linear map to be nearly isometric. First, it must be 
isometric in expectation. Second, the probability of large distortions of length must be exponentially 



small. The exponential bound in (4.2) guarantees union bounds will be small even for rather large 
sets. This concentration is the typical ingredient required to prove the Johnson-Lindenstrauss 
Lemma (cf [2JQJ)]). 

The majority of nearly isometric random maps are described in terms of random matrices. For 
a linear map A : M. mxn — > MP, we can always write its matrix representation as 

A(X) = Avec(X), (4.4) 

where vec(X) denotes the vector of X with its columns stacked in order on top of one another, 
and A is a p x mn matrix. We now give several examples of nearly isometric random variables 
in this matrix representation. The most well known is the ensemble with independent, identically 
distributed (i.i.d.) Gaussian entries [15 a 

Aij^MiO,-). (4.5) 
p 

We also mention the two following ensembles of matrices, described in [2]. One has entries sampled 
from an i.i.d. symmetric Bernoulli distribution 

- with probability h 
1 with probability ^ 



P V,1U±1 piuuaumy 2 

and the other has zeros in two-thirds of the entries 



Aij 



| with probability g 



with probability \ . (4.7) 



| with probability g 



The fact that the top singular value of the matrix A is concentrated around 1 + \fl5Jp for all of 
these ensembles follows from the work of Yin, Bai, and Krishnaiah, who showed that whenever the 
entries A^ are i.i.d. with zero mean and finite fourth moment, then the maximum singular value 
of A is almost surely 1 + \/D /p for D sufficiently large |56j. El Karoui uses this result to prove 
the concentration inequality ( |4.3[ ) for all such distributions [23] . The result for Gaussians is rather 
tight with 7 = 1/2 (see, e.g., |16j). 

Finally, note that a random projection also obeys all of the necessar y co ncentration inequali- 



ties. Indeed, since the norm of a random projection is exactly \JD /p, (4.3) holds trivially. The 



concentration inequality (4.2) is proven in [15 . 



The main result of this section is the following: 

Theorem 4.2 Fix < 5 < 1. If A is a nearly isometric random variable, then for every 1 < r < m, 
there exist constants Co, c\ > depending only on 5 such that, with probability at least 1— exp(— c\p), 
S r (A) < 5 whenever p > cor(m + n) log(mra). 
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The proof will make use of standard techniques in concentration of measure. We first extend 
the concentration results of [4 j to subspaces of matrices. We will show that the distortion of a 
subspace by a linear map is robust to perturbations of the subspace. Finally, we will provide an 
epsilon net over the set of all subspaces and, using a union bound, will show that with overwhelming 



probability, nearly isometric random variables will obey the Restricted Isometry Property (3.2) as 
the size of the matrices tend to infinity. 

The following lemma characterizes the behavior of a nearly isometric random mapping A when 
restricted to an arbitrary subspace of matrices U of dimension d. 

Lemma 4.3 Let A be a nearly isometric linear map and let U be an arbitrary subspace of m x n 
matrices with d = dim(?7) < p. Then for any < 5 < 1 we have 

{1-5)\\X\\ F <\\A(X)\\<(1 + 5)\\X\\ F VXeU (4.8) 

with probability at least 

1 - 2(12/<5) d exp (-|(<5 2 /8 - 5 3 /24)) . (4.9) 

Proof The proof of this theorem is identical to the argument in j4] where the authors restricted 
their attention to subspaces aligned with the coordinate axes. We will sketch the proof here as the 
argument is straightforward. 

There exists a finite set fi of at most (12/ 5)^ points such that for every X G U with ||X||_p < 1, 
there exists aQGfi such that — Q||_p<<5/4. By the standard union bound, the concentration 



inequality ( 4.2 1 holds for all Q G O with e = (5/2 with probability at least ( |4~9| ). If ( [472 ) holds for 
all Q en, then we immediately have that (1 - 5/2) \\Q\\ F < \\A(Q)\\ < {l + 8/2)\\Q\\ F for all Q G n 
as well. 

Let X be in {X G U : ||-X"||f < 1}) an d M be the maximum of ||«4(X)|| on this set. Then there 
exists a Q G O such that \\X — Q\\ F < 5/4. We then have 

\\A(X)\\ < \\A(Q)\\ + \\A(X - Q)\\ < 1 + 5/2 + M5/A, 

and since M< 1 + 5/2 + M5/4 by definition, we have M < 1 + 5. The lower bound is proven by 
the following chain of inequalities 

\\A(X)\\ > \\A(Q)\\ ~ WAX - Q)\\ > 1 - 5/2 - (1 + 5)5/4 > 1 - 5. 



The proof of preceding lemma revealed that the near isometry of a linear map is robust to small 
perturbations of the matrix on which the map is acting. We will now show that this behavior is 
robust with respect to small perturbations of the subspace U as well. This perturbation will be 
measured in the natural distance between two subspaces 

p(T u T 2 ) := \\P Tl -Pt 2 \\, (4.10) 

where T\ and T 2 are subspaces and is the orthogonal projection associated with each subspace. 
This distance measures the operator norm of the difference between the corresponding projections, 
and is equal to the sine of the largest principal angle between T\ and T 2 [1J. 
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The set of all d-dimensional subspaces of M. D is commonly known as the Grassmannian manifold 
(3(D,d). We will endow it with the metric •) given by (4.10), also known as the projection 2- 
norm. In the following lemma we characterize and quantify the change in the isometry constant 5 
as one smoothly moves through the Grassmannian. 

Lemma 4.4 Let U\ and U% be d- dimensional subspaces ofK D . Suppose that for all X £ U\, 

{l-5)\\X\\ F <\\A{X)\\<{l + 5)\\X\\ F (4.11) 

for some constant < 8 < 1. Then for all Y G U2 

(1-S')\\Y\\ F <\\A(Y)\\<(1 + S')\\Y\\ F (4.12) 

with 

S , = 6+(1+\\A\\)' P (U 1 ,U 2 ). (4.13) 
Proof Consider any Y £ U2- Then 



\\A(Y)\\ = \\A(P Ul (Y)-[P Ul -P U2 ](Y))\\ 

KWAiPu^YM + WAdPu.-Pu^YM 
< (1 + 5) \\P Vl (Y) \\ F + \\A\\ \\Pu, - Pu 2 1| \\Y \\ 
<(l + S+\\A\\\\Pu 1 -Pu a \\)\\Y\\ F . 



(4.14) 



(4.15) 



Similarly, we have 

\\A(Y)\\ > 11^(10)11 - ||A([JVx -Pu 2 ](Y))\\ 

> (1 - S^Pu^Wf - \\A\W\Pu, - P U2 \\\\Y\\f 

> (1 - S)\\Y\\ F - (1 - 5MPU, - P U2 ){Y)\\ F - \\A\W\Pu, - Pu 2 \\\\Y\\f 
>[1-5-{\\A\\ + 1)\\P Ux -P U2 \\}\\Y\\f, 

which completes the proof. ■ 

To apply these concentration results to low-rank matrices, we characterize the set of all matrices 
of rank at most r as a union of subspaces. Let V C M m and W C W 1 be fixed subspaces of dimension 
r. Then the set of all m x n matrices X whose row space is contained in W and column space 
is contained in V forms an r 2 -dimensional subspace of matrices of rank less than or equal to r. 
Denote this subspace as X(V, W) C W nxn . Any matrix of rank less than or equal to r is an element 
of some £(V, W) for a suitable pair of subspaces, i.e., the set 

S mnr := {E(V, W) : V G <8(m,r), We 0(n,r)}. 

We now characterize how many subspaces are necessary to cover this set to arbitrary resolution. The 
covering number 91(e) of S mnr at resolution e is defined to be the smallest number of subspaces 
(Vi,Wi) such that for any pair of subspaces (V,W), there is an i with p(E(V, W), U(Vi, Wj)) < 
e. That is, the covering number is the smallest cardinality of an e-net. The following Lemma 
characterizes the cardinality of such a set. 
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Lemma 4.5 The covering number 91(e) of T, mnr is bounded above by 

(n/^i \ r(m+n—2r) 
-f) (4-16) 

where Co is a constant independent of e, m, n, and r. 

Proof Note that the projection operator onto £(V, W) can be written as P^{y,w) = Pv ® Pwi 
so for a pair of subspaces (Vi, W\) and (V%, W 2 ), we have 

pMVuWtmv^W*)) = ||P S (v llWl ) -Ph{vm\\ 

= \\P Vl ®P Wl -Pv 2 ®P w J 

= \\(P Vl - P V2 ) P Wl + P V2 (P Wl - P W2 )\\ (4.17) 

< HiVi -iValllliVill + II^IHIiVi - JVall 

= p(V 1 ,V 2 )+p(W 1 ,W 2 ). 

The conditions p(V x , V 2 ) < f and p(Wi, W 2 ) < | together imply that p(E(Vi, Wi), S(F 2 , W 2 )) < 
p(Vi,V2) + p(Wi,W2) < e. Let , . . . , Vjvi cover the set of r-dimensional subspaces of W 71 to 
resolution e/2 and t/i, . . . , Un 2 cover the r-dimensional subspaces of W 1 to resolution e/2. Then 
for any (V,W), there exist % and j such that p(V,Vi) < e/2 and p(W,Wj) < e/2. Therefore, 
91(e) < N\N 2 . By the work of Szarek on e-nets of the Grassmannian ([32]) [50[ Th. 8]) there is a 
universal constant Co, independent of m, n, and r, such that 

JVl<(^) and ^2<(^) (4.18) 

which completes the proof. ■ 

The exact value of the universal constant Co is not provided by Szarek in [SO]. It takes the 
same value for any homogeneous space whose automorphism group is a subgroup of the orthogonal 
group, and is independent of the dimension of the homogeneous space. Hence, one might expect 
this constant to be quite large. However, it is known that for the sphere Co < 3 [35 , and there is 
no indication that this constant is not similarly small for the Grassmannian. 

We now proceed to the proof of the main result in this section. For this, we use a union bound 



to combine the probabilistic guarantees of Lemma 4.3 with the estimates of the covering number 
of E(U,V). 

Proof [of Theorem [42] 



Let O = {(Vi, Wi)} be a finite set of subspaces that satisfies the conditions of Lemma 4.5 for 
e > 0, so |0| < 01(e). For each pair (Vi, Wi), define the set of matrices 

Bi := {x 3(V, W) such that X G E(V, W) and p(E(V, W),V(V h Wi)) < e} . (4.19) 

Since O is an e-net, we have that the union of all the Bi is equal to T, mnr . Therefore, if for all i, 
(1 - 5)||X|| F < \\A(X)\\ < (1 + 5)||X|| F for all X G Bi, we must have that 5 r (A) < 5 proving that 

P(S r (A)<S)=P[(l-S)\\X\\ F <\\A(X)\\<(l + d)\\X\\ F VXs.t. rank(X)<r] 
>P[rfi(l-S)\\X\\ F <\\A(X)\\<(l + S)\\X\\ F V I £ 
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Now note that if we have (1 + \\A\\)e < 5/2 and, for all Y £ E(V-, Wi), (1 - S/2)\\Y\\ F < \\A(Y)\\ < 
(1 + 5/2)|| Y\\ F , Lemma O implies that (1 - 5)\\X\\ F < \\A(X)\\ < (1 + 5)\\X\\ F for all X G 



Therefore, using a union bound, (4.20) is greater than or equal to 



i=l 



lYesw.Wi) < (i - |)||y|| F or M(y)||>(i + |)||i'|| r 

> *-i 

- 2e 



(4.21) 



We can bound these quantities separately. First we have by Lemmas 4.3 and 4.5 



|fi| 



i=l 



3YeV(Vi,W t ) \\A(Y)\\ < (1 - 6 -)\\Y\\ F or |L4(Y)|| > (1 + S -)\\Y\\ F 



£2inu)(^V eX p(-|(5 2 /32-5 3 /96) 



(4.22) 



< 2 



, \ r(m+n-2r) /^, A \r 2 

~ CiA f2 j) exp (-|(^/32-^/96)). 



Secondly, since A is nearly isometric, there exists a constant 7 such that 



> 1 + 



mn 



V 



+ 1 J < exp(-7pi 2 ) . 



In particular, 



> 



2e 



1 < exp — 7p 



2c 



(4.23) 



(4.24) 



We now must pick a suitable resolution e to guarantee that this probability is less than exp(— c\p) 
for a suitably chosen constant c\. First note that if we choose e < (5/4)(y/mn/p + 1) _1 , 



> 1 ) < exp(— 7mn) , 

— ■ £ 



(4.25) 



which achieves the desired scaling because mn > p. With this choice of e, the quantity in Equa- 

r(m+n—2r) 



tion (4.22) is less than or equal to 

■( 



( 8Co(7Wy^ + i) 
5 



{24/5) r ' exp ( -|(5 2 /32 - 5 s /96) 



exp I —pa(S) + r(m + n — 2r) log 



mn 



+r{m + n — 2r) log 



+ 1 



8C 



(4.26) 



+ r los 



where a(<5) = (5 2 /64 — <5 3 /192. Since mn/p < mn for allp > 1, there exists a constant Co independent 
of m,n,p, and r, such that the sum of the last three terms in the exponent are bounded above by 
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{cq/ a{5))r{m + n) log(mn). It follows that there exists a constant c\ independent of m,n,p, and r 
such that p > cor(m + n) log(mn) observations are sufficient to yield an RIP of 5 with probability 
greater than 1 — e~ ClP . ■ 

Heuristically, the scaling p = O (r(m + n) log(mn)) is very reasonable, since a rank r matrix 
has r{m + n — r) degrees of freedom. This coarse tail bound only provides asymptotic estimates 
for recovery, and is quite conservative in practice. As we demonstrate in Section [6j minimum rank 
solutions can be determined from between 2r(m + n — r) to 4r(m + n — r) observations for many 
practical problems. 



5 Algorithms for nuclear norm minimization 

A variety of methods can be developed for the effective minimization of the nuclear norm over an 
affine subspace of matrices, and we do not have room for a comprehensive treatment here. Instead, 
we focus on three methods highlighting the trade-offs between computational speed and guarantees 
on accuracy of the resulting solution. Directly solving the semidefinite characterization of the 
nuclear norm problem using primal-dual interior point methods is a numerically efficient method 
for small problems and can be used to yield accuracy up to floating-point precision. 

Since interior point methods use second order information, the memory requirements for com- 
puting descent directions quickly becomes too large as the problem size increases. Moreover, for 
larger problem sizes it is preferable to use methods that exploit, at least partially, the structure of 
the problem. This can be done at several levels, either by taking into account further information 
that may be available about the linear map A (e.g., the case of partially observed Fourier mea- 
surements) or by formulating algorithms that are specific to the nuclear norm problem. For the 
latter, we show how to apply subgradient methods to minimize the nuclear norm over an affine set. 
Such first-order methods cannot yield as high numerical precision as interior point methods, but 
much larger problems can be solved because no second-order information needs to be stored. For 
even larger problems, we discuss a low-rank semidefinite programming that explicitly works with a 
factorization of the decision variable. This method can be applied even when the matrix decision 
variable cannot fit into memory, but convergence guarantees are much less satisfactory than in the 
other two cases. 



5.1 Interior Point Methods for Semidefinite programming 

For small problems where a high-degree of numerical precision is required, interior point methods 
for semidefinite programming can be directly applied to solve affine nuclear minimization problems. 
As we have seen in earlier sections, the nuclear norm minimization problem can be directly posed as 



a semidefinite programming problem via the standard form primal-dual pair (2.8). As written, the 
primal problem has one (n+m) x (n+m) semidefinite constraint and p affine constraints. Conversely, 
the dual problem has one (n + m) x (n + m) semidefinite constraint and p scalar decision variables. 
Thus, the total number of decision variables (primal and dual) is equal to ( n +™ +1 ) _|_ p. 

Modern interior point solvers for semidefinite programming generally use primal-dual methods, 
and compute an update direction for the current solution by solving a suitable Newton system. 
Depending on the structure of the linear mapping A, this may entail solving a potentially large, 
dense linear system. 
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If the matrix dimensions n and m are not too large, then any good interior point SDP solver, 
such as SeDuMi |48 j or SDPT3 [51, will quickly produce accurate solutions. In fact, as we will see 
in the next section, problems with n and m around 50 can be solved to machine precision in minutes 
on a desktop computer to machine precision. However, solving such a primal-dual pair of programs 
with traditional interior point methods can prove to be quite challenging when the dimensions of 
the matrix X are much bigger than 100 x 100, since in this case the corresponding Newton systems 
become quite large. In the absence of any specific additional structure, the memory requirements 
of such dense systems quickly limit the size of problems that can be solved. 

Perhaps the most important drawback of the direct SDP approach is that it completely ignores 
the possibility of efficiently computing the nuclear norm via a singular value decomposition, instead 
of the less efficient eigenvalue decomposition of a bigger matrix. The method we discuss next will 
circumvent this obstacle, by directly working with subgradients of the nuclear norm. 



5.2 Projected subgradient methods 



The nuclear norm minimization (3.1) is a linearly constrained nondifferentiable convex problem. 
There are numerous techniques to approach this kind of problems, depending on the specific na- 
ture of the constraints (e.g., dense vs. sparse), and the possibility of using first- or second-order 
information. 

In this section we describe a simple, easy to implement, subgradient projection approach to the 



solution of (3.1). This first-order method will proceed by computing a sequence of feasible points 



{X k }, with iterates satisfying the update rule 

X k+1 = U(X k -s k Y k ), Y k €d\\X k \\„ 

where II is the orthogonal projection onto the affine subspace defined by the linear constraints 
A(X) = b, and s k > is a stepsize parameter. In other words, the method updates the current 
iterate X k by taking a step along the direction of a subgradient at the current point and then 
projecting back onto the feasible set. Alternatively, since X k is feasible, we can rewrite this as 

-Xfc+i = X k — s k TLj\Y k , 

where II4 is the orthogonal projection onto the kernel of A. Since the feasible set is an affine 
subspace, there are several options for the projection For small problems, one can precompute 
it using, for example, a QR decomposition of the matrix representation of A and store it. Alterna- 
tively, one can solve a least squares problem at each step by iterative methods such as conjugate 
gradients. 

The subgradient-based method described above is extremely simple to implement, since only a 
subgradient evaluation is required at every step. The computation of the subgradient can be done 



using the formula given in (2.9) earlier, thus requiring only a singular value decomposition of the 
current point X k . 

A possible alternative here to the use of the SVD for the subgradient computation is to directly 
focus on the "angular" factor of the polar decomposition of X k , using for instance the Newton-like 



methods developed by Gander in [301. Specifically, for a given matrix X k , the Halley-like iteration 

X X(X'X + 3I)(3X'X + I)- 1 
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converges globally and quadratically to the polar factor of X, and thus yields an element of the 
subdifferential of the nuclear norm. This iteration method (suitable scaled) can be faster than a 
direct SVD computation, particularly if the singular values of the initial matrix are close to 1. This 
could be appealing since presumably only a very small number of iterations would be needed to 
update the polar factor of X k , although the nonsmoothness of the subdifferential is bound to cause 
some additional difficulties. 

Regarding convergence, for general nonsmooth problems, subgradient methods do not guarantee 
a decrease of the cost function at every iteration, even for arbitrarily small step sizes (see, e.g., 
[2 §6.3.1]), unless the minimum-norm subgradient is used. Instead, convergence is usually shown 
through the decrease (for small stepsize) of the distance from the iterates X k to any optimal 
point. There are several possibilities for the choice of stepsize s k . The simplest choice that can 
guarantee convergence is to use a diminishing stepsize with an infinite travel condition (i.e., such 
that limfe^oc s k = and J2 k>Q s k diverging). 

Often times, even the computation of a singular value decomposition or Halley-like iteration can 
be too computationally expensive. The next section proposes a reduction of the size of the search 
space to alleviate such demands. We must give up guarantees of convergence for this convenience, 
but this may be an acceptable trade-off for very large-scale problems. 



5.3 Low-rank parametrization 

We now turn to a method that works with an explicit low-rank factorization of X. This algorithm 
not only requires less storage capacity and computational overhead than the previous methods, 
but for many problems does not even require one to be able to store the decision variable X in 
memory. This is the case, for example, in the matrix completion problem where A(X) is a subset 
of the entries of X. 

Given observations of the form A(X) = b of an m x n matrix X of rank r, a possible search 
algorithm to find a suitable X would be to find a factorization X = LR', where L is an m x r 
matrix and R an n x r matrix, such that the equality constraints are satisfied. Since there are many 
possible such factorizations, we could search for one where the matrices L and R have Frobenius 
norm as small as possible, i.e., the solution of the optimization problem 

min l(\\Lf F + \\R\\ F ) 

(5.1) 

s.t. A(LR') = b. 

Even though the cost function is convex, the constraint is not. Such a problem is a non convex 
quadratic program, and it is not evidently easy to optimize. We show below that the minimization 
of the nuclear norm subject to equality constraints is in fact equivalent to this rather natural 
heuristic optimization, as long as r is chosen to be sufficiently larger than the rank of the optimum 
of the nuclear norm problem. 



Lemma 5.1 Assume r > rank(Xo). The nonconvex quadratic optimization problem (5.1) is equiv- 



alent to the minimum nuclear norm relaxation (3.1). 



Proof Consider any feasible solution (L, R) of (5.1 ). Then, defining W\ := LL', Wi := RR' , and 



X := LR' yields a feasible solution of the primal SDP problem (2.8) that achieves the same cost 
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Since the SDP formulation is equivalent to the nuclear norm problem, we have that the optimal 



value of (5.1) is always greater than or equal to the nuclear norm heuristic. 



For the converse, we can use an argument similar to the proof of Proposition 2.1 From the 
SVD decomposition X* = UHV 1 of the optimal solution of the nucle ar norm relaxation (3.1 1, we 
can explicitly construct matrices L := UY,2 and R := VY,? for (5.1) that yield exactly the same 
value of the objective. ■ 

The main advantage of this reformulation is to substantially decrease the number of primal 
decision variables from rem to (re + m)r. For large problems, this is quite a significant reduction 
that allows us to search for matrices of rank smaller than the order of 100, and n+m'm the hundreds 
of thousands on a desktop computer. However, this problem is nonconvex and potentially subject 
to local minima. This is not as much of a problem as it could be, for two reasons. First recall from 
Theorem 3.2 that if b~2 r {A) < 1, there is a unique X* with rank at most r such that A{X*) = b. 
Since any local minima (L* , R*) of (5.1 ) is feasible, we would have X* = L*R*' and we would have 



found the minimum rank solution. Second, we now present an algorithm that is guaranteed to 
converge to a local minima for a judiciously selected r. We will also provide a sufficient condition 



for when we can construct an optimal solution of (2.8) from the solution computed by the method 
of multipliers. 



SDPLR and the method of multipliers For general semidefinite programming problems, 
Burer and Monteiro have developed in [SJ |S| a nonlinear programming approach that relies on a 
low-rank factorization of the matrix decision variable. We will adapt this idea to our problem, 
to provide a first-order Lagrangian minimization algorithm that efficiently finds a local minima 
of (5.1). As a consequence of the work in jS], it will follow that for values of r larger than the 



rank of the true optimal solution, the local minima of (5.1 ) can be transformed into global minima 
of (2.8) under the identification W\ = LL' , W2 = RR' and Y = LR'. We summarize below the 



details of this approach. 

The algorithm employed is called the method of multipliers, a standard approach for solving 
equality constrained optimization problems pa]. The method of multipliers works with an augmented 



Lagrangian for (5.1) 



C a (L,R;y,a) 



\{\\L\ 



2 F + \\R\? F ) 



y'(A(LR') -b) + %\\A(LR') - b\\- 



(5.2) 



where the yi are arbitrarily signed Lagrange multipliers and a is a positive constant. A somewhat 
similar algorithm was proposed by Rennie et al in |42j in the collaborative filtering. In this work, 
the authors minimize C a with a fixed and y = to serve as a regularized algorithm for matrix 
completion. Remarkably, by deterministically varying a and y, this method can be adapted into 
an algorithm for solving linearly constrained nuclear-norm minimization. 

In the method of multipliers, one alternately minimizes the augmented Lagrangian with respect 
to the decision variables L and R, and then increases the value of the penalty coefficient a and 
updates y. The augmented Lagrangian can be minimized using any local search technique, and the 
partial derivatives are particularly simple to compute. Let y := y — o~{A{LR') — b). Then we have 



V L £a 



L-A*(y)R 
R-A*(y)'L. 
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To calculate the gradients, we first compute the constraint violations A(LR') — b, then form y, and 
finally use the above equations to compute the gradients. 

As the number of iterations tends to infinity, only feasible points will have finite values of C a , 
and for any feasible point, C a (L,R) is equal to the original cost function + [|i£[||.)/2. The 

method terminates when L and R are feasible, as in this case the Lagrangian is stationary and 



we are at a local minima of (5.1). Including the y multipliers improves the conditioning of each 



subproblem where C a is minimized and enhances the rate of convergence. The following theorem 



shows that when the method of multipliers converges, it converges to a local minimum of (5.1 ). 



Theorem 5.2 Suppose we have a sequence {L^ k \ R^ k \y^) of local minima of the augmented La- 
grangian at each step of the method of multipliers. Assume that our sequence of — ► oo and that 
the sequence of y( k > is bounded. If (L^ , converges to (L*,R*) and the linear map 



A( fc )(y) :-- 



A*(y)'LW 



(5.3) 



has kernel equal to the zero vector for all k, then there exists a vector y* such that 

(i) VC a (L*,R*;y*) = 

(ii) A(L*R*') = b 

Proof This proof is standard and follows the approach in [6]. As above, we define y^ := 

y(k) _ a {k) 

(A(L^RW ) - b) for all k. Since (£(*), .RW) minimize the augmented Lagrangian at 
iteration k, we have 



= L<® -A*(y (k) )R (k) 
= i#) - A*{y {k) )'L {k \ 



which we may rewrite as 



A«(yW) 



L {k) 
R {h) 



(5.4) 



(5.5) 



Since we have assumed that there is no non-zero y with A^ k \y) = 0, there exists a left-inverse and 
we can solve for y( k \ 



y 



(*) = aW 1 



R{k) 



(5.6) 



Everything on the right-hand side is bounded, and L^> and R^ converge. Therefore, we must 
have that y^ converges to some y* . Taking the limit of (5.4) proves (i). To prove (ii), note that 
we must have y^ is bounded. Since y^ is also bounded, we find that o~( k ' (A(L( k > R^ k > ) — b) is 
also bounded. But implies that A(L*R*') = b, completing the proof. ■ 

Suppose the decision variables are chosen to be of size m x and nxr^. A necessary condition 
for A k (y) to be full rank is for the number of decision variables rd(m + n) to be greater than the 
number of equalities p. In particular, this means that we must choose > p/{m + n) in order to 



have any hopes of satisfying the conditions of Theorem 5.2 
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Figure 1: The MIT logo image. The associated matrix has dimensions 46 x 81 and has rank 5. 



We close this section by relating the solution found by the method of multipliers to the optimal 
solution of the nuclear norm minimization problem. We have already shown that when the low- 
rank algorithm converges, it converges to a low-rank solution of A(X) = b. If we additionally 
find that A*(y*) has norm less than or equal to one, then it is dual feasible. One can check 
using straightforward algebra that (L* R* , L* L* , R* R*') and y* form an optimal primal-dual pair 



for (2.8). This analysis proves the following theorem. 



Theorem 5.3 Let (L*,R*,y*) satisfy (i)-(ii) in Theorem 5.2 and suppose \\A*{y*)\\ < 1. 
(L* R* , L* L* , R* R*') is an optimal primal solution and y* is an optimal dual solution of (2.8) 



Then 



6 Numerical Experiments 

To illustrate the scaling of low-rank recovery for a particular matrix M, consider the MIT logo 
presented in Figure [TJ The image has a total of 46 rows and 81 columns (total 3726 elements), and 
3 distinct non-zero numerical values corresponding to the colors white, red, and grey. Since the 
logo only has 5 distinct rows, it has rank 5. For each of the ensembles discussed in Section [4j we 
sampled measurement matrices with p ranging between 700 and 1500, and solved the semidefinite 



program (2.6) using the freely available software SeDuMi [48 . On a 2.0 GHz Laptop, each semidef- 
inite program could be solved in less than four minutes. We chose to use this interior point method 
because it yielded the highest accuracy in the shortest amount of time, and we were interested in 
characterizing precisely when the nuclear norm heuristic succeeded and failed. 

Figure [2] plots the Frobenius norm of the difference between the optimal point of the semidefinite 
program and the true image in Figure [2] We observe a sharp transition to perfect recovery near 
1200 measurements which is approximately equal to 2r(m + n — r). In Figure [3j we graphically 
plot the recovered solutions for various values of p under the Gaussian ensemble. 

To demonstrate the average behavior of low-rank recovery, we conducted a series of experiments 
for a variety of the matrix sizes n, ranks r, and numbers of measurements p. For a fixed n, we 
constructed random recovery scenarios for low-rank nxn matrices. For each n, we varied p between 
and n 2 where the matrix is completely discovered. For a fixed n and p, we generated all possible 
ranks such that r(2n — r) < p. This cutoff was chosen because beyond that point there would be 
an infinite set of matrices of rank r satisfying the p equations. 
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Figure 2: (a) Error, as measured by the Frobcnius norm, between the recovered image and the ground truth. 
Observe that there is a sharp transition to near zero error at around 1200 measurements, (b) Zooming in on 
this transition, we see fluctuation between high and low error when between 1125 and 1225 measurements 
are available. 
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(c) 

Figure 3: Example recovered images using the Gaussian ensemble, (a) 700 measurements, (b) 1100 mea- 
surements (c) 1250 measurements. The total number of pixels is 46 x 81 = 3726. Note that the error is 
plotted on a logarithmic scale. 



For each (n,p,r) triple, we repeated the following procedure 10 times. A matrix of rank r was 
generated by choosing two random n x r factors Yl and Yr with i.i.d. random entries and setting 
Yq = YlY' r . A matrix A was sampled from the Gaussian ensemble with p rows and n 2 columns. 
Then the nuclear norm minimization 



mm 

X 



\X\ 



s.t. Avec(A) = Avec(lb 



(6.1) 



was solved using the SDP solver SeDuMi on the formulation (2.6). Again, we chose to use SeDuMi 



because we wanted to precisely distinguish between success and failure of the heuristic. We declared 
Yo to be recovered if \\X — Yq\\f/\\Yq\\f < 10 -3 . Figure [4] shows the results of these experiments 
for n = 30 and 40. The color of the cell in the figures reflects the empirical recovery rate of the 
10 runs (scaled between and 1). White denotes perfect recovery in all experiments, and black 
denotes failure for all experiments. 

These experiments demonstrate that the logarithmic factors and constants present in our scaling 
results are somewhat conservative. For example, as one might expect, low-rank matrices are per- 



fectly recovered by nuclear norm minimization when p 



n 



as the matrix is uniquely determined. 



Moreover, as p is reduced slightly away from this value, low-rank matrices are still recovered 100 
percent of the time for most values of r. Finally, we note that despite the asymptotic nature of our 
analysis, our experiments demonstrate excellent performance with low-rank matrices of size 30 x 30 
and 40 x 40 matrices, showing that the heuristic is practical even in low-dimensional settings. 

Intriguingly, Figure [4] also demonstrates a "phase transition" between perfect recovery and 
failure. As observed in several recent papers by Donoho and his collaborators (See e.g. |18[ 19J), 
the random sparsity recovery problem has two distinct connected regions of parameter space: one 
where the sparsity pattern is perfectly recovered, and one where no sparse solution is found. Not 
surprisingly, Figure [4] illustrates an analogous phenomenon in rank recovery. Computing explicit 
formulas for the transition between perfect recovery and failure is left for future work. 



7 Discussion and future developments 

Having illustrated the natural connections between affine rank minimization and affine cardinality 
minimization, we were able to draw on these parallels to determine scenarios where the nuclear 
norm heuristic was able to exactly solve the rank minimization problem. These scenarios directly 
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Figure 4: For each (n,p,r) triple, we repeated the following procedure ten times. A matrix of rank r was 
generated by choosing two random nxr factors Yl and Yr with i.i.d. random entries and set Yq = Y^Y^. We 
select a matrix A from the Gaussian ensemble with p rows and n 2 columns. Then we solve the nuclear norm 
minimization subject to Avec(A) = Avec(Yn) We declare Y to be recovered if \\X — YoI|f/||^o||f < 10 -3 . 
The results are shown for (a) n = 30 and (b) n = 40. The color of each cell reflects the empirical recovery 
rate (scaled between and 1). White denotes perfect recovery in all experiments, and black denotes failure 
for all experiments. 
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generalized conditions for which the l\ heuristic succeeded and ensembles of linear maps for which 
these conditions hold. Furthermore, our experimental results display similar recovery properties 
to those demonstrated in the empirical studies of t\ minimization. Inspired by the success of 
this program, we close this report by briefly discussing several exciting directions that are natural 
continuations of this work building on more analogies from the compressed sensing literature. We 
also describe possible extensions to more general notions of parsimony. 

Factored measurements and alternative ensembles All of the measurement ensembles re- 
quire the storage of 0(mnp) numbers. For large problems this is wholly impractical. There are 
many promising alternative measurement ensembles that seem to obey the same scaling laws as 
those presented in Section [ZJ For example, "factored" measurements, of the form Ai : X i— ► ufXvi, 
where m , i»j are Gaussian random vectors empirically yield the same performance as the Gaussian 
ensemble. This factored ensemble only requires storage of 0{{m + n)p) numbers, which is a rather 
significant savings for very large problems. The proof in Section [4] does not seem to extend to 
this ensemble, thus new machinery must be developed to guarantee properties about such low-rank 
measurements. 

Noisy Measurements and Low-rank approximation Our results in this paper address only 
the case of exact (noiseless) measurements. It is of natural interest to understand the behavior of 
the nuclear norm heuristic in the case of noisy data. Based on the existing results for the sparse 
case (e.g., [H]), it would be natural to expect similar stability properties of the recovered solution, 
for instance in terms of the £2 norm of the computed solution. Such an analysis could also be used 
to study the nuclear norm heuristic as an approximation technique where a matrix has rapidly 
decaying singular values and a low-rank approximation is desired. 

Incoherent Ensembles and Partially Observed Transforms Again, taking our lead from 
the compressed sensing literature, it would be of great interest to extend the results of [11] to 
low-rank recovery. In this work, the authors show that partially observed unitary transformations 
of sparse vectors can be used to recover the sparse vector using i\ minimization. There are many 
practical applications where low-rank processes are partially observed. For instance, the matrix 
completion problem can be thought of as partial observations under the identity transformations. 
As another example, there are many examples in two-dimensional Fourier spectroscopy where only 
partial information can be observed due to experimental constraints. 

Alternative numerical methods Besides the techniques described in Section [5] there are a 
number of interesting additional possibilities to solve the nuclear norm minimization problem. An 
appealing suggestion is to combine the strength of second-order methods (as in the SDP approach) 
with the known geometry of the nuclear norm (as in the subgradient approach), and develop 
a customized interior point method, possibly yielding faster convergence rates, while still being 
relatively memory-efficient. 

It is also of much interest to investigate the possible adaptation of some of the successful path- 
following approaches in traditional l\ /cardinality minimization, such as the Homotopy [40 or LARS 
(least angle regression) [2T] . This may be not be completely straightforward, since the efficiency of 
many of these methods often relies explicitly on the polyhedral structure of the feasible set of the 
l\ norm problem. 
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Geometric interpretations For the case of cardinality ji\ minimization, a beautiful geometric 
interpretation has been set forth by Donoho and Tanner |18| [T9] . Key to their results is the notion 
of central k-neighborliness of a centrosymmetric polytope, namely the property that every subset 
of k + 1 vertices not including an antipodal pair spans a fc-face. In particular, they show that the 
i\ heuristic always succeeds whenever the image of the l\ unit ball (the cross-polytope) under the 
linear mapping A is a centrally ^-neighborly polytope. 

In the case of rank minimization, the direct application of these concepts fails, since the unit ball 
of the nuclear norm is not a polyhedral set. Nevertheless, it seems likely that a similar explanation 
could be developed, where the key feature would be the preservation under a linear map of the 
extremality of the components of the boundary of the nuclear norm unit ball defined by low-rank 
conditions. 

Jordan algebras As we have seen, our results for the rank minimization problem closely parallel 
the earlier developments in cardinality minimization. A convenient mathematical framework that 
allows the simultaneous consideration of these cases as well as a few new ones, is that of Jordan 
algebras and the related symmetric cones [24 . In the Jordan-algebraic setting, there is an intrinsic 
notion of rank that agrees with the cardinality of the support in the case of the nonnegative 
orthant or the rank of a matrix in the case of the positive semi definite cone. Besides mathematical 
elegance, a direct Jordan- algebraic approach would transparently yield similar results for the case 
of second-order (or Lorentz) cone constraints. 

As specific examples of the power and elegance of this approach, we mention the work of 
Faybusovich [25 and Schmieta and Alizadeh jl3] that provide a unified development of interior point 
methods for symmetric cones, as well as Faybusovich's work on convexity theorems for quadratic 
mappings [26] . 

Parsimonious models and optimization Sparsity and low-rank are two specific classes of 
parsimonious (or low-complexity) descriptions. Are there other kinds of easy-to-describe paramet- 
ric models that are amenable to exact solutions via convex optimizations techniques? Given the 
intimate connections between linear and semidefinite programming and the Jordan algebraic ap- 
proaches described earlier, it is likely that this will require alternative tractable convex optimization 
formulations. 
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